{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example of Textual Augmenter Usage<a class=\"anchor\" id=\"home\"></a>:\n",
    "* [Character Augmenter](#chara_aug)\n",
    "    * [OCR](#ocr_aug)\n",
    "    * [Keyboard](#keyboard_aug)\n",
    "    * [Random](#random_aug)\n",
    "* [Word Augmenter](#word_aug)\n",
    "    * [Spelling](#spelling_aug)\n",
    "    * [Word Embeddings](#word_embs_aug)\n",
    "    * [TF-IDF](#tfidf_aug)\n",
    "    * [Contextual Word Embeddings](#context_word_embs_aug)\n",
    "    * [Synonym](#synonym_aug)\n",
    "    * [Antonym](#antonym_aug)\n",
    "    * [Random Word](#random_word_aug)\n",
    "    * [Split](#split_aug)\n",
    "    * [Back Translatoin](#back_translation_aug)\n",
    "    * [Reserved Word](#reserved_aug)\n",
    "* [Sentence Augmenter](#sent_aug)\n",
    "    * [Contextual Word Embeddings for Sentence](#context_word_embs_sentence_aug)\n",
    "    * [Abstractive Summarization](#abst_summ_aug)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "os.environ[\"MODEL_DIR\"] = '../model'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Config"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nlpaug.augmenter.char as nac\n",
    "import nlpaug.augmenter.word as naw\n",
    "import nlpaug.augmenter.sentence as nas\n",
    "import nlpaug.flow as nafc\n",
    "\n",
    "from nlpaug.util import Action"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The quick brown fox jumps over the lazy dog .\n"
     ]
    }
   ],
   "source": [
    "text = 'The quick brown fox jumps over the lazy dog .'\n",
    "print(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Character Augmenter<a class=\"anchor\" id=\"chara_aug\">\n",
    "\n",
    "Augmenting data in character level. Possible scenarios include image to text and chatbot. During recognizing text from image, we need to optical character recognition (OCR) model to achieve it but OCR introduces some errors such as recognizing \"o\" and \"0\". `OCRAug` simulate these errors to perform the data augmentation. For chatbot, we still have typo even though most of application comes with word correction. Therefore, `KeyboardAug` is introduced to simulate this kind of errors."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### OCR Augmenter<a class=\"anchor\" id=\"ocr_aug\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Substitute character by pre-defined OCR error"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original:\n",
      "The quick brown fox jumps over the lazy dog .\n",
      "Augmented Texts:\n",
      "['The quick bkown fox jumps ovek the lazy dog .', 'The quick 6rown fox jumps ovek the lazy dog .', 'The quick brown f0x jomps over the la2y dog .']\n"
     ]
    }
   ],
   "source": [
    "aug = nac.OcrAug()\n",
    "augmented_texts = aug.augment(text, n=3)\n",
    "print(\"Original:\")\n",
    "print(text)\n",
    "print(\"Augmented Texts:\")\n",
    "print(augmented_texts)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Keyboard Augmenter<a class=\"anchor\" id=\"keyboard_aug\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Substitute character by keyboard distance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original:\n",
      "The quick brown fox jumps over the lazy dog .\n",
      "Augmented Text:\n",
      "The quick brown Gox juJps ocer the lazy dog .\n"
     ]
    }
   ],
   "source": [
    "aug = nac.KeyboardAug()\n",
    "augmented_text = aug.augment(text)\n",
    "print(\"Original:\")\n",
    "print(text)\n",
    "print(\"Augmented Text:\")\n",
    "print(augmented_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Random Augmenter<a class=\"anchor\" id=\"random_aug\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Insert character randomly"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original:\n",
      "The quick brown fox jumps over the lazy dog\n",
      "Augmented Text:\n",
      "T3he quicNk @brown fEox juamps $over th6e la1zy d*og\n"
     ]
    }
   ],
   "source": [
    "aug = nac.RandomCharAug(action=\"insert\")\n",
    "augmented_text = aug.augment(text)\n",
    "print(\"Original:\")\n",
    "print(text)\n",
    "print(\"Augmented Text:\")\n",
    "print(augmented_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Substitute character randomly"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original:\n",
      "The quick brown fox jumps over the lazy dog\n",
      "Augmented Text:\n",
      "ThN qDick brow0 foB jumks oveE t+e laz6 dBg\n"
     ]
    }
   ],
   "source": [
    "aug = nac.RandomCharAug(action=\"substitute\")\n",
    "augmented_text = aug.augment(text)\n",
    "print(\"Original:\")\n",
    "print(text)\n",
    "print(\"Augmented Text:\")\n",
    "print(augmented_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Swap character randomly"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original:\n",
      "The quick brown fox jumps over the lazy dog\n",
      "Augmented Text:\n",
      "Hte quikc borwn fxo jupms ovre teh lzay dgo\n"
     ]
    }
   ],
   "source": [
    "aug = nac.RandomCharAug(action=\"swap\")\n",
    "augmented_text = aug.augment(text)\n",
    "print(\"Original:\")\n",
    "print(text)\n",
    "print(\"Augmented Text:\")\n",
    "print(augmented_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Delete character randomly"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original:\n",
      "The quick brown fox jumps over the lazy dog\n",
      "Augmented Text:\n",
      "Te quic rown fx jump ver he laz og\n"
     ]
    }
   ],
   "source": [
    "aug = nac.RandomCharAug(action=\"delete\")\n",
    "augmented_text = aug.augment(text)\n",
    "print(\"Original:\")\n",
    "print(text)\n",
    "print(\"Augmented Text:\")\n",
    "print(augmented_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Word Augmenter<a class=\"anchor\" id=\"word_aug\"></a>\n",
    "\n",
    "Besides character augmentation, word level is important as well. We make use of word2vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), fasttext (Joulin et al., 2016), BERT(Devlin et al., 2018) and wordnet to insert and substitute similar word. `Word2vecAug`,  `GloVeAug` and `FasttextAug` use word embeddings to find most similar group of words to replace original word. On the other hand, `BertAug` use language models to predict possible target word. `WordNetAug` use statistics way to find the similar group of words."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Spelling Augmenter<a class=\"anchor\" id=\"spelling_aug\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Substitute word by spelling mistake words dictionary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original:\n",
      "The quick brown fox jumps over the lazy dog .\n",
      "Augmented Texts:\n",
      "['Tha qchick brown fox jumps ower the lazy dog.', 'Their quick borwn fox jumps over tge lazy dog.', 'The qchick brown fox jumps ower the lazy dod.']\n"
     ]
    }
   ],
   "source": [
    "aug = naw.SpellingAug()\n",
    "augmented_texts = aug.augment(text, n=3)\n",
    "print(\"Original:\")\n",
    "print(text)\n",
    "print(\"Augmented Texts:\")\n",
    "print(augmented_texts)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original:\n",
      "The quick brown fox jumps over the lazy dog .\n",
      "Augmented Texts:\n",
      "['They quick browb fox jumps over se lazy dog.', 'The quikly brown fox jumps over tge lazy dod.', 'Tha quick brown fox jumps ower their lazy dog.']\n"
     ]
    }
   ],
   "source": [
    "aug = naw.SpellingAug()\n",
    "augmented_texts = aug.augment(text, n=3)\n",
    "print(\"Original:\")\n",
    "print(text)\n",
    "print(\"Augmented Texts:\")\n",
    "print(augmented_texts)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Word Embeddings Augmenter<a class=\"anchor\" id=\"word_embs_aug\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Insert word randomly by word embeddings similarity"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original:\n",
      "The quick brown fox jumps over the lazy dog\n",
      "Augmented Text:\n",
      "The quick brown fox jumps Alzeari over the lazy Superintendents dog\n"
     ]
    }
   ],
   "source": [
    "# model_type: word2vec, glove or fasttext\n",
    "aug = naw.WordEmbsAug(\n",
    "    model_type='word2vec', model_path=model_dir+'GoogleNews-vectors-negative300.bin',\n",
    "    action=\"insert\")\n",
    "augmented_text = aug.augment(text)\n",
    "print(\"Original:\")\n",
    "print(text)\n",
    "print(\"Augmented Text:\")\n",
    "print(augmented_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Substitute word by word2vec similarity"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original:\n",
      "The quick brown fox jumps over the lazy dog\n",
      "Augmented Text:\n",
      "The easy brown fox jumps around the lazy dog\n"
     ]
    }
   ],
   "source": [
    "# model_type: word2vec, glove or fasttext\n",
    "aug = naw.WordEmbsAug(\n",
    "    model_type='word2vec', model_path=model_dir+'GoogleNews-vectors-negative300.bin',\n",
    "    action=\"substitute\")\n",
    "augmented_text = aug.augment(text)\n",
    "print(\"Original:\")\n",
    "print(text)\n",
    "print(\"Augmented Text:\")\n",
    "print(augmented_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### TF-IDF Augmenter<a class=\"anchor\" id=\"tfidf_aug\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Insert word by TF-IDF similarity"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original:\n",
      "The quick brown fox jumps over the lazy dog\n",
      "Augmented Text:\n",
      "sinks The quick brown fox jumps over the lazy Sidney dog\n"
     ]
    }
   ],
   "source": [
    "aug = naw.TfIdfAug(\n",
    "    model_path=os.environ.get(\"MODEL_DIR\"),\n",
    "    action=\"insert\")\n",
    "augmented_text = aug.augment(text)\n",
    "print(\"Original:\")\n",
    "print(text)\n",
    "print(\"Augmented Text:\")\n",
    "print(augmented_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Substitute word by TF-IDF similarity"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original:\n",
      "The quick brown fox jumps over the lazy dog\n",
      "Augmented Text:\n",
      "The quick brown fox Baked over the polygraphy dog\n"
     ]
    }
   ],
   "source": [
    "aug = naw.TfIdfAug(\n",
    "    model_path=os.environ.get(\"MODEL_DIR\"),\n",
    "    action=\"substitute\")\n",
    "augmented_text = aug.augment(text)\n",
    "print(\"Original:\")\n",
    "print(text)\n",
    "print(\"Augmented Text:\")\n",
    "print(augmented_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Contextual Word Embeddings Augmenter<a class=\"anchor\" id=\"context_word_embs_aug\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Insert word by contextual word embeddings (BERT, DistilBERT, RoBERTA or XLNet)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original:\n",
      "The quick brown fox jumps over the lazy dog\n",
      "Augmented Text:\n",
      "even the quick brown fox usually jumps over the lazy dog\n"
     ]
    }
   ],
   "source": [
    "aug = naw.ContextualWordEmbsAug(\n",
    "    model_path='bert-base-uncased', action=\"insert\")\n",
    "augmented_text = aug.augment(text)\n",
    "print(\"Original:\")\n",
    "print(text)\n",
    "print(\"Augmented Text:\")\n",
    "print(augmented_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Substitute word by contextual word embeddings (BERT, DistilBERT, RoBERTA or XLNet)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original:\n",
      "The quick brown fox jumps over the lazy dog\n",
      "Augmented Text:\n",
      "little quick brown fox jumps over the lazy dog\n"
     ]
    }
   ],
   "source": [
    "aug = naw.ContextualWordEmbsAug(\n",
    "    model_path='bert-base-uncased', action=\"substitute\")\n",
    "augmented_text = aug.augment(text)\n",
    "print(\"Original:\")\n",
    "print(text)\n",
    "print(\"Augmented Text:\")\n",
    "print(augmented_text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original:\n",
      "The quick brown fox jumps over the lazy dog .\n",
      "Augmented Text:\n",
      "the striped brown fox jumps over the muddy grass .\n"
     ]
    }
   ],
   "source": [
    "aug = naw.ContextualWordEmbsAug(\n",
    "    model_path='distilbert-base-uncased', action=\"substitute\")\n",
    "augmented_text = aug.augment(text)\n",
    "print(\"Original:\")\n",
    "print(text)\n",
    "print(\"Augmented Text:\")\n",
    "print(augmented_text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original:\n",
      "The quick brown fox jumps over the lazy dog .\n",
      "Augmented Text:\n",
      "The quick brown fox jumps Into the bull dog .\n"
     ]
    }
   ],
   "source": [
    "aug = naw.ContextualWordEmbsAug(\n",
    "    model_path='roberta-base', action=\"substitute\")\n",
    "augmented_text = aug.augment(text)\n",
    "print(\"Original:\")\n",
    "print(text)\n",
    "print(\"Augmented Text:\")\n",
    "print(augmented_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Synonym Augmenter<a class=\"anchor\" id=\"synonym_aug\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Substitute word by WordNet's synonym"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original:\n",
      "The quick brown fox jumps over the lazy dog .\n",
      "Augmented Text:\n",
      "The speedy brown fox jumps complete the lazy dog .\n"
     ]
    }
   ],
   "source": [
    "aug = naw.SynonymAug(aug_src='wordnet')\n",
    "augmented_text = aug.augment(text)\n",
    "print(\"Original:\")\n",
    "print(text)\n",
    "print(\"Augmented Text:\")\n",
    "print(augmented_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Substitute word by PPDB's synonym"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original:\n",
      "The quick brown fox jumps over the lazy dog .\n",
      "Augmented Text:\n",
      "The quick brown fox climbs over the lazy dog .\n"
     ]
    }
   ],
   "source": [
    "aug = naw.SynonymAug(aug_src='ppdb', model_path=os.environ.get(\"MODEL_DIR\") + 'ppdb-2.0-s-all')\n",
    "augmented_text = aug.augment(text)\n",
    "print(\"Original:\")\n",
    "print(text)\n",
    "print(\"Augmented Text:\")\n",
    "print(augmented_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Antonym Augmenter<a class=\"anchor\" id=\"antonym_aug\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Substitute word by antonym"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original:\n",
      "Good boy\n",
      "Augmented Text:\n",
      "Good daughter\n"
     ]
    }
   ],
   "source": [
    "aug = naw.AntonymAug()\n",
    "_text = 'Good boy'\n",
    "augmented_text = aug.augment(_text)\n",
    "print(\"Original:\")\n",
    "print(_text)\n",
    "print(\"Augmented Text:\")\n",
    "print(augmented_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Random Word Augmenter<a class=\"anchor\" id=\"random_word_aug\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Swap word randomly"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original:\n",
      "The quick brown fox jumps over the lazy dog .\n",
      "Augmented Text:\n",
      "Quick the brown fox jumps over the lazy dog .\n"
     ]
    }
   ],
   "source": [
    "aug = naw.RandomWordAug(action=\"swap\")\n",
    "augmented_text = aug.augment(text)\n",
    "print(\"Original:\")\n",
    "print(text)\n",
    "print(\"Augmented Text:\")\n",
    "print(augmented_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Delete word randomly"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original:\n",
      "The quick brown fox jumps over the lazy dog\n",
      "Augmented Text:\n",
      "The brown jumps over the lazy dog\n"
     ]
    }
   ],
   "source": [
    "aug = naw.RandomWordAug()\n",
    "augmented_text = aug.augment(text)\n",
    "print(\"Original:\")\n",
    "print(text)\n",
    "print(\"Augmented Text:\")\n",
    "print(augmented_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Delete a set of contunous word will be removed randomly"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original:\n",
      "The quick brown fox jumps over the lazy dog .\n",
      "Augmented Text:\n",
      "The quick brown fox jumps dog .\n"
     ]
    }
   ],
   "source": [
    "aug = naw.RandomWordAug(action='crop')\n",
    "augmented_text = aug.augment(text)\n",
    "print(\"Original:\")\n",
    "print(text)\n",
    "print(\"Augmented Text:\")\n",
    "print(augmented_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Split Augmenter<a class=\"anchor\" id=\"split_aug\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Split word to two tokens randomly"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original:\n",
      "The quick brown fox jumps over the lazy dog .\n",
      "Augmented Text:\n",
      "The q uick b rown fox jumps o ver the lazy dog .\n"
     ]
    }
   ],
   "source": [
    "aug = naw.SplitAug()\n",
    "augmented_text = aug.augment(text)\n",
    "print(\"Original:\")\n",
    "print(text)\n",
    "print(\"Augmented Text:\")\n",
    "print(augmented_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Back Translation Augmenter<a class=\"anchor\" id=\"back_translation_aug\"></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'The speedy brown fox jumped over the lazy dog'"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import nlpaug.augmenter.word as naw\n",
    "\n",
    "text = 'The quick brown fox jumped over the lazy dog'\n",
    "back_translation_aug = naw.BackTranslationAug(\n",
    "    from_model_name='facebook/wmt19-en-de', \n",
    "    to_model_name='facebook/wmt19-de-en'\n",
    ")\n",
    "back_translation_aug.augment(text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'The speedy brown fox jumped over the lazy dog'"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Load models from local path\n",
    "import nlpaug.augmenter.word as naw\n",
    "\n",
    "from_model_dir = os.path.join(os.environ[\"MODEL_DIR\"], 'word', 'fairseq', 'wmt19.en-de')\n",
    "to_model_dir = os.path.join(os.environ[\"MODEL_DIR\"], 'word', 'fairseq', 'wmt19.de-en')\n",
    "\n",
    "text = 'The quick brown fox jumped over the lazy dog'\n",
    "back_translation_aug = naw.BackTranslationAug(\n",
    "    from_model_name=from_model_dir, from_model_checkpt='model1.pt',\n",
    "    to_model_name=to_model_dir, to_model_checkpt='model1.pt', \n",
    "    is_load_from_github=False)\n",
    "back_translation_aug.augment(text)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reserved Word Augmenter<a class=\"anchor\" id=\"reserved_aug\"></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nlpaug.augmenter.word as naw\n",
    "\n",
    "text = 'Fwd: Mail for solution'\n",
    "reserved_tokens = [\n",
    "    ['FW', 'Fwd', 'F/W', 'Forward'],\n",
    "]\n",
    "reserved_aug = naw.ReservedAug(reserved_tokens=reserved_tokens)\n",
    "augmented_text = reserved_aug.augment(text)\n",
    "\n",
    "print(\"Original:\")\n",
    "print(text)\n",
    "print(\"Augmented Text:\")\n",
    "print(augmented_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Sentence Augmentation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Contextual Word Embeddings for Sentence Augmenter<a class=\"anchor\" id=\"context_word_embs_sentence_aug\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Insert sentence by contextual word embeddings (GPT2 or XLNet)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original:\n",
      "The quick brown fox jumps over the lazy dog .\n",
      "Augmented Texts:\n",
      "['The quick brown fox jumps over the lazy dog . A terrible , messy split second presents itself to the heart - which is we lose our heart.', 'The quick brown fox jumps over the lazy dog . Cast from the heart - the above flash is insight to the heart.', 'The quick brown fox jumps over the lazy dog . Give two mom s time to share some affection over this heart shaped version of Scott.']\n"
     ]
    }
   ],
   "source": [
    "# model_path: xlnet-base-cased or gpt2\n",
    "aug = nas.ContextualWordEmbsForSentenceAug(model_path='xlnet-base-cased')\n",
    "augmented_texts = aug.augment(text, n=3)\n",
    "print(\"Original:\")\n",
    "print(text)\n",
    "print(\"Augmented Texts:\")\n",
    "print(augmented_texts)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original:\n",
      "The quick brown fox jumps over the lazy dog .\n",
      "Augmented Text:\n",
      "The quick brown fox jumps over the lazy dog . J in a Better Balls of Fire cameo on St iring.\n"
     ]
    }
   ],
   "source": [
    "aug = nas.ContextualWordEmbsForSentenceAug(model_path='gpt2')\n",
    "augmented_text = aug.augment(text)\n",
    "print(\"Original:\")\n",
    "print(text)\n",
    "print(\"Augmented Text:\")\n",
    "print(augmented_text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original:\n",
      "The quick brown fox jumps over the lazy dog .\n",
      "Augmented Text:\n",
      "The quick brown fox jumps over the lazy dog . They start shooting wildly.\n"
     ]
    }
   ],
   "source": [
    "aug = nas.ContextualWordEmbsForSentenceAug(model_path='gpt2')\n",
    "augmented_text = aug.augment(text)\n",
    "print(\"Original:\")\n",
    "print(text)\n",
    "print(\"Augmented Text:\")\n",
    "print(augmented_text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original:\n",
      "The quick brown fox jumps over the lazy dog .\n",
      "Augmented Text:\n",
      "The quick brown fox jumps over the lazy dog . She keeps running around the house.\n"
     ]
    }
   ],
   "source": [
    "aug = nas.ContextualWordEmbsForSentenceAug(model_path='distilgpt2')\n",
    "augmented_text = aug.augment(text)\n",
    "print(\"Original:\")\n",
    "print(text)\n",
    "print(\"Augmented Text:\")\n",
    "print(augmented_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Abstractive Summarization Augmenter<a class=\"anchor\" id=\"abst_summ_aug\"></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original:\n",
      "\n",
      "The history of natural language processing (NLP) generally started in the 1950s, although work can be \n",
      "found from earlier periods. In 1950, Alan Turing published an article titled \"Computing Machinery and \n",
      "Intelligence\" which proposed what is now called the Turing test as a criterion of intelligence. \n",
      "The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian \n",
      "sentences into English. The authors claimed that within three or five years, machine translation would\n",
      "be a solved problem. However, real progress was much slower, and after the ALPAC report in 1966, \n",
      "which found that ten-year-long research had failed to fulfill the expectations, funding for machine \n",
      "translation was dramatically reduced. Little further research in machine translation was conducted \n",
      "until the late 1980s when the first statistical machine translation systems were developed.\n",
      "\n",
      "Augmented Text:\n",
      "the history of natural language processing (NLP) generally started in the 1950s. work can be found from earlier periods, such as the Georgetown experiment in 1954. little further research in machine translation was conducted until the late 1980s \n"
     ]
    }
   ],
   "source": [
    "article = \"\"\"\n",
    "The history of natural language processing (NLP) generally started in the 1950s, although work can be \n",
    "found from earlier periods. In 1950, Alan Turing published an article titled \"Computing Machinery and \n",
    "Intelligence\" which proposed what is now called the Turing test as a criterion of intelligence. \n",
    "The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian \n",
    "sentences into English. The authors claimed that within three or five years, machine translation would\n",
    "be a solved problem. However, real progress was much slower, and after the ALPAC report in 1966, \n",
    "which found that ten-year-long research had failed to fulfill the expectations, funding for machine \n",
    "translation was dramatically reduced. Little further research in machine translation was conducted \n",
    "until the late 1980s when the first statistical machine translation systems were developed.\n",
    "\"\"\"\n",
    "\n",
    "aug = nas.AbstSummAug(model_path='t5-base', num_beam=3)\n",
    "augmented_text = aug.augment(article)\n",
    "print(\"Original:\")\n",
    "print(article)\n",
    "print(\"Augmented Text:\")\n",
    "print(augmented_text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "nlpaug_dev",
   "language": "python",
   "name": "nlpaug_dev"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
